Introduction and Taxonomy
The survey focuses on evaluating large language models (LLMs), specifically their capabilities and their alignment with human values. The taxonomy and roadmap of the survey are as follows:
-
Introduction: Introduces the concept of machine intelligence and the need for evaluating LLMs.
-
Taxonomy and Roadmap: Presents a taxonomy framework for evaluating LLMs, which includes five fundamental domains: Knowledge and Capability Evaluation, Alignment Evaluation, Safety Evaluation, Specialized LLMs Evaluation, and Evaluation Organization.
-
Knowledge and Capability Evaluation: Discusses the evaluation of LLMs’ knowledge and reasoning capabilities, including question answering, knowledge completion, reasoning (common sense, logical, multi-hop, and mathematical), and tool learning.
-
Alignment Evaluation: Focuses on evaluating the alignment of LLMs with human values, covering ethics and morality, bias detection, toxicity assessment, and truthfulness evaluation.
-
Safety Evaluation: Explores the evaluation of LLMs’ robustness and the assessment of risks associated with their behaviors and potential misuse.
-
Specialized LLMs Evaluation: Examines the evaluation of LLMs in specialized domains such as biology and medicine, education, legislation, computer science, and finance.
-
Evaluation Organization: Provides an overview of existing benchmarks and evaluation methodologies for LLMs, including benchmarks for natural language understanding and generation, knowledge and reasoning, and holistic evaluation.
-
Future Directions: Discusses future research directions, including risk evaluation, agent evaluation, dynamic evaluation, and enhancement-oriented evaluation for LLMs.
-
Conclusion: Summarizes the main findings of the survey.
The survey aims to provide a comprehensive overview of the current state of LLM evaluation research, examining both capabilities and alignment aspects. It expands on existing surveys by integrating insights across different evaluation categories and providing a more holistic characterization of LLM evaluation. The taxonomy framework helps structure the survey and allows readers to gain a nuanced understanding of LLM performance and challenges in diverse domains.
Words: 291